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Amendments to the Claims : 
This listing of claims replaces all prior versions and listings of claims in the application: 

Listing; of Claims : 

Claims 1-54 (Canceled). 

55. (New) A computer-implemented method for identifying compounds in text, 
comprising: 

extracting a vocabulary of tokens from text; 

iterating from n > 2 down to n = 2 where n decreases by one each iteration and 
in each iteration performing the actions of: 

identifying a plurality of unique n-grams in the text, each rc-gram being 
an occurrence in the text of n sequential tokens, each token being found in the 
vocabulary; 

dividing each n-gram into n-\ pairs of two adjacent segments, where 
each segment consists of at least one token; 

for each «-gram, calculating a likelihood of collocation for each pair of 
segments of the n-gram and determining a score for the «-gram based on a lowest 
calculated likelihood of collocation; 

identifying a set of «-grams having scores above a threshold; and 

adding the identified set of n-grams as compound tokens to the 
vocabulary and removing constituent tokens that occur in the added compound tokens 
from the vocabulary. 

56. (New) The method of claim 55 where calculating a likelihood of collocation for each 
pair of segments of the n-gram comprises determining a likelihood ratio X for each 
pair of segments that is computed in accordance with the formula: 
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where L(Hi) is a likelihood of observing Hi under an independence hypothesis, L(H C ) 
is a likelihood of observing H c under a collocation hypothesis, and H is a pair of 
segments. 

57. (New) The method of claim 56 where the L(H C ) is computed for each pair of 

segments, t\, t 2 , in each «-gram in accordance with the formula: 

Lit, Jiform compound) 

arg max — -. v 1 2J r . 

l(Hi) L\n - gram does not form compound) 

58. (New) The method of claim 56 where, for each pair of segments, t\, t 2 , in each n- 
gram, the independence hypothesis comprises P(t z \ t l ) = p(t 2 1 1 ] ) and the collocation 
hypothesis comprises P(t 2 \t l )> p{t 2 \ t x ). 

59. (New) The method of claim 55 where identifying a plurality of unique «-grams in the 
text comprises skipping n-grams appearing in a list of known compounds. 

60. (New) A computer program product, encoded on a computer-readable medium, 
operable to cause data processing apparatus to perform operations comprising: 

extracting a vocabulary of tokens from text; 

iterating from n > 2 down to n = 2 where n decreases by one each iteration and 
in each iteration performing the actions of: 

identifying a plurality of unique n-grams in the text, each n-gram being 
an occurrence in the text of n sequential tokens, each token being found in the 
vocabulary; 

dividing each «-gram into n-\ pairs of two adjacent segments, where 
each segment consists of at least one token; 
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for each n-gram, calculating a likelihood of collocation for each pair of 

segments of the n-gram and determining a score for the n-gram based on a lowest 

calculated likelihood of collocation; 

identifying a set of n-grams having scores above a threshold; and 
adding the identified set of n-grams as compound tokens to the 

vocabulary and removing constituent tokens that occur in the added compound tokens 

from the vocabulary. 

61. (New) The program product of claim 60 where calculating a likelihood of collocation 
for each pair of segments of the n-gram comprises determining a likelihood ratio X for 
each pair of segments that is computed in accordance with the formula: 

W7) 

where L(Hi) is a likelihood of observing H t under an independence hypothesis, L(H C ) 
is a likelihood of observing H c under a collocation hypothesis, and His a pair of 



62. (New) The program product of claim 61 where the L(H C ) is computed for each pair 
of segments, t\, t 2 , in each n-gram in accordance with the formula: 
L(t x , t 2 form compound ) 



arg max — -, c . 

l{h,) L[n - gram does not form compound ) 



63. (New) The program product of claim 61 where, for each pair of segments, t\, in 
each rc-gram, the independence hypothesis comprises P(t 2 | t x ) = p(t 2 1 t x ) and the 
collocation hypothesis comprises P(t 2 \t x )> p{t 2 1 t x ). 



64. 



(New) The program product of claim 60 where identifying a plurality of unique n- 
grams in the text comprises skipping «-grams appearing in a list of known 
compounds. 
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65. (New) A system comprising: 

a computer readable medium including a program product; and 
one or more processors configured to execute the program product and 
perform operations comprising: 

extracting a vocabulary of tokens from text; 

iterating from n > 2 down to n = 2 where n decreases by one each iteration and 
in each iteration performing the actions of: 

identifying a plurality of unique «-grams in the text, each rc-gram being 
an occurrence in the text of n sequential tokens, each token being found in the 
vocabulary; 

dividing each w-gram into n-\ pairs of two adjacent segments, where 
each segment consists of at least one token; 

for each n-gram, calculating a likelihood of collocation for each pair of 
segments of the w-gram and determining a score for the rc-gram based on a lowest 
calculated likelihood of collocation; 

identifying a set of n-grams having scores above a threshold; and 

adding the identified set of «-grams as compound tokens to the 
vocabulary and removing constituent tokens that occur in the added compound tokens 
from the vocabulary. 

66. (New) The system of claim 65 where calculating a likelihood of collocation for each 
pair of segments of the n-gram comprises determining a likelihood ratio X for each 
pair of segments that is computed in accordance with the formula: 

where L(H,) is a likelihood of observing H t under an independence hypothesis, L{H C ) 
is a likelihood of observing H c under a collocation hypothesis, and His a pair of 
segments. 
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67. (New) The system of claim 66 where the L{H C ) is computed for each pair of 

segments, t\, h, in each n-gram in accordance with the formula: 

L(t 1 ,t 2 form compound) 
arg max ; r . 

£(//,) L(n — gram does not form compound) 

68. (New) The system of claim 66 where, for each pair of segments, t\, h, in each n- 
gram, the independence hypothesis comprises P(t 2 \t l ) = p{t 2 \ t x ) and the collocation 
hypothesis comprises P(t 2 \ t x )> p{t 2 \ t^). 



69. (New) The system of claim 65 where identifying a plurality of unique n-grams in the 
text comprises skipping n-grams appearing in a list of known compounds. 



